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Abstract 

In this paper we extend temporal difference 
policy evaluation algorithms to performance 
criteria that include the variance of the cumu- 
lative reward. Such criteria are useful for risk 
management, and are important in domains 
such as finance and process control. We pro- 
pose both TD(0) and LSTD(A) variants with 
linear function approximation, prove their 
convergence, and demonstrate their utility in 
a 4-dimensional continuous state space prob- 
lem. 

1. Introduction 

In both Reinforcement Learning (RL; Bertsekas & 
Tsitsiklis, 1996) and planning in Markov Decision Pro- 
cesses (MDPs; Puterman, 1994), the typical objective 
is to maximize the cumulative (possibly discounted) 
expected reward, denoted by J. In many applications, 
however, the decision maker is also interested in min- 
imizing some form of risk of the policy. By risk, we 
mean reward criteria that take into account not only 
the expected reward, but also some additional statis- 
tics of the total reward such as its variance, its Value 
at Risk, etc. (Luenberger, 1998). 

In this work we focus on risk measures that involve 
the variance of the cumulative reward, denoted by V. 
Typical performance criteria that fall under this defi- 
nition include 

(a) Maximize J s.t. V < c 

(b) Minimize V s.t. J > c 

(c) Maximize the Sharpe Ratio: J/\/V 



(d) Maximize J — c\/V 

The rationale behind our choice of risk measure is that 
these performance criteria, such as the Sharpe Ratio 
(Sharpe, 1966) mentioned above, are being used in 
practice. Moreover, it seems that human decision mak- 
ers understand how to use variance well, in comparison 
to exponential utility functions (Howard & Matheson, 
1972), which require determining a non-intuitive ex- 
ponent coefficient. 

A fundamental concept in RL is the the value function 
- the expected reward to go from a given state. Esti- 
mates of the value function drive most RL algorithms, 
and efficient methods for obtaining these estimates 
have been a prominent area of research. In particu- 
lar. Temporal Difference (TD; (Sutton & Barto, 1998)) 
based methods have been found suitable for problems 
where the state space is large, requiring some sort of 
function approximation. TD methods enjoy theoreti- 
cal guarantees (Bertsekas, 2012; Lazaric et al., 2010) 
and empirical success (Tesauro, 1995), and are consid- 
ered the state of the art in policy evaluation. 

In this work we present a TD framework for estimating 
the variance of the reward to go. Our approach is 
based on the following key observation: the second 
moment of the reward to go, denoted by M, together 
with the value function J, obey a linear equation - 
similar to the Bellman equation that drives regular 
TD algorithms. By extending TD methods to jointly 
estimate J and M, we obtain a solution for estimating 
the variance, using the relation V = M — J'^. 

We propose both a variant of Least Squares Temporal 
Difference (LSTD) (Boyan, 2002) and of TD(0) (Sut- 
ton & Barto, 1998) for jointly estimating J and M 
with a linear function approximation. For these algo- 
rithms, we provide convergence guarantees and error 
bounds. In addition, we introduce a novel approach 
for enforcing the approximate variance to be positive, 
through a constrained TD equation. 
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Finally, an empirical evaluation on a challenging con- 
tinuous maze domain highlights both the usefulness 
of our approach, and the importance of the variance 
function in understanding the risk of a policy. 

This paper is organized as follows. In Section 2 we 
present our formal RL setup. In Section 3 we derive 
the fundamental equations for jointly approximating 
J and M, and discuss their properties. A solution 
to these equations may be obtained by simulation, 
through the use of TD algorithms, as presented in 
Section 4. In Section 5 we further extend the LSTD 
framework by forcing the approximated variance to be 
positive. Section 6 presents an empirical evaluation, 
and Section 7 concludes, and discusses future direc- 
tions. 

2. Framework and Background 

We consider a Stochastic Shortest Path (SSP) prob- 
lem^ (Bertsekas, 2012), where the environment is mod- 
eled by an MDP in discrete time with a finite state set 
X = {1, . . . , n} and a terminal state x* . A fixed policy 
TT determines, for each a; G AT, a stochastic transition 
to a subsequent state y G {X U x*} with probability 
P{y\x). We consider a deterministic and bounded re- 
ward function r : AT — ^ M. We denote by xt the state 
at time fc, where fc = 0, 1, 2, . . .. 

A policy is said to be proper (Bertsekas, 2012) if there 
is a positive probability that the terminal state x* will 
be reached after at most n transitions, from any initial 
state. In this paper we make the following assumption 

Assumption 1. The policy tt is proper. 

Let T = min{fc > 0\xk — x*} denote the first visit time 
to the terminal state, and let the random variable B 
denote the accumulated reward along the trajectory 
until that time^ 

T-l 

B^5^r(a;fc). 

k=0 

In this work, we are interested in the mean-variance 
tradeoff in B, represented by the value function 

J{x) ^ E [B\xo = x], xeX, 

and the variance of the reward to go 

V{x) ^ Var [B\xo ^ x] , x e X. 

^This is also known as an episodic setup. 

^We do not define the reward at the terminal state as it 
is not relevant to our performance criteria. However, the 
customary zero terminal reward may be assumed through- 
out the paper. 



We will find it convenient to define also the second 
moment of the reward to go 

M{x) = E[B'^\xo = x], xeX. 

Our goal is to estimate J{x) and V{x) from trajectories 
obtained by simulating the MDP with policy tt. 

3. Approximation of the Variance of 
the Reward To Go 

In this section we derive a projected equation method 
for approximating J{x) and M{x) using linear func- 
tion approximation. The estimation of V{x) will then 
follow from the relation V{x) = M{x) — J{x)^. 

Our starting point is a system of equations for J{x) 
and M{x), first derived by Sobcl (1982) for a dis- 
counted infinite horizon case, and extended here to 
the SSP case. Note that the equation for J is the well 
known Bellman equation for a fixed policy, and inde- 
pendent of the equation for M. 

Proposition 2. The following equations hold for x G 
X 

J{x)=r{x) + Y,P{y\x)J{y), 

Mix) = r{xf + 2r{x) ^ P{y\x)J{y) + ^ P{y\x)M[y). 

yex yex 

(1) 

Furthermore, under Assumption 1 a unique solution 
to (1) exists. 

The proof is straightforward, and given in Appendix 
A. 

At this point the reader may wonder why an equation 
for V is not presented. While such an equation may be 
derived, as was done in (Tamar et al., 2012), it is not 
linear. The linearity of (1) is the key to our approach. 
As we show in the next subsection, the solution to 
(1) may be expressed as the fixed point of a linear 
mapping in the joint space of J and M. We will then 
show that a projection of this mapping onto a linear 
feature space is contracting, thus allowing us to use 
existing TD theory to derive estimation algorithms for 
J and M. 

3.1. A Projected Fixed Point Equation on the 
Joint Space of J and M 

For the sequel we introduce the following vector no- 
tations. We denote by P G K"^" and r G K" 
the SSP transition matrix and reward vector, i.e.. 
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P{y\x) and — r{x), where x.y G X. Also, and let 



we define R ~ diag{r 



For a vector 



z e 



we let zj E M" and zm G 



denote its leading and ending n components, respec- 
tively. Thus, such a vector belongs to the joint space 
of J and M. 



We define the mapping T 



by 



[Tz]j = r + Pzj, 
[Tz]m = Rr + 2RPzj + Pzm- 

It may easily be verified that a fixed point of T is a 
solution to (1), and by Proposition 2 such a fixed point 
exists and is unique. 

When the state space X is large, a direct solution of (1) 
is not feasible, even if P may be accurately obtained. 
A popular approach in this case is to approximate J{x) 
by restricting it to a lower dimensional subspace, and 
use simulation based TD algorithms to adjust the ap- 
proximation parameters (Bertsekas, 2012). In this pa- 
per we extend this approach to the approximation of 
M{x) as well. 

We consider a linear approximation architecture of the 
form 



J{x) = (j)j{x)'^Wj, 

M{x) = (I)m{x)'^wm, 



(2) 



where wj G M' ' and wm G are the approximation 
parameter vectors, 4>j{x) S M'-' and (pnix) G K'" are 
state dependent features, and (•)^ denotes the trans- 
pose of a vector. The low dimensional subspaces are 
therefore 



Sj 
Sm 



{^jw\w G I 
{^AfU^lw G 



where and $m are matrices whose rows are (j)j{x)'^ 
and (pMix)'^, respectively. We make the following 
standard independence assumption on the features 
Assumption 3. The matrix $j has rank Ij and the 
matrix $m has rank I m ■ 

As outlined earlier, our goal is to estimate wj and wm 
from simulated trajectories of the MDP. Thus, it is 
constructive to consider projections onto Sj and Sm 
with respect to a norm that is weighted according to 
the state occupancy in these trajectories. 

For a trajectory xq, . . . , Xr-i, where xq is drawn from 
a fixed distribution Co (2;), and the states evolve ac- 
cording to the MDP with policy tt, define the state 
occupancy probabilities 

qt{x) ^ P{xt ^ x), xeX, < = 0,1,... 



<l{x) ^^<lt{x), xeX 



t=o 



Q=diag{q). 

We make the following assumption on the policy tt and 
initial distribution Co 

Assumption 4. Each state has a positive probability 
of being visited, namely, q{x) > for all x G X . 

For vectors in M", we introduce the weighted Euclidean 
norm 



IIdII, 



J2<li^)iy{i)f^ yG 



and we denote by 11 j and IIm the projections from 
K" onto the subspaces Sj and Sm, respectively, with 
respect to this norm. For z G M^" we denote by 11 the 
projection of zj onto Sj and zm onto Sm, namely 



n 



n,/ 








(3) 



We are now ready to fully describe our approximation 
scheme. We consider the projected fixed point equation 



z = htz, 



(4) 



and, letting z* denote its solution, propose the approx- 
imate value function J 



z*j G Sj and second moment 



function M 



-M 



G S 



M- 



We proceed to derive some properties of the projected 
fixed point equation (4). We begin by stating a well 
known result regarding the contraction properties of 
the projected Bellman operator IljTj, where Tjy = 
r + Py. A proof can be found at (Bertsekas, 2012), 
proposition 7.1.1. 

Lemma 5. Let Assumptions 1, 3, and 4 hold. Then, 
there exists some norm \\ ■ \\j and some /3j < 1 such 
that 

||njP2/||j </3,/bll,/, Vt/GM". 

Similarly, there exists some norm \\ ■ \\m and some 
Pm < 1 such that 

llnA/PyllA/ </3m||2/||a/, VyGM". 



Next, we define a weighted norm on 



p2n 



^The projection operators IIj and IIm are linear, and 



may be written explicitly as Ilj 
and similarly for IIm. 
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Definition 6. For a vector z e M^" and a scalar < 
a < 1, the a-weighted norm is 



\\z\\a = 0:\\zj\\j + (1 - a)\\zM\\M, 



(5) 



where the norms \\-\\.j and \ \ • ||m are defined in Lemma 
5. 

Our main result of this section is given in the following 
lemma, where we show that the projected operator IIT 
is a contraction with respect to the a-weighted norm. 

Lemma 7. Let Assumptions 1, 3, and hold. Then, 
there exists some < a < 1 and some (3 < 1 such that 
nT is a P -contraction with respect to the a-weighted 
norm, i.e., 

||nr^lU</3|lz|U, VzGM^n^ 

Proof. Let V denote the following matrix in I]j2nx2n 

r - 

and let z G M^". We need to show that 



P 
2RP P 



inT'zlU </3||z|U. 



From (3) we have 
\IV = 

Therefore, we have 



n,/P 

2IImRP HmP 



(6) 



\\nPz\\^ =a\\RjPzj\\j 

+ (1 - a)\\2I].MRPzj + UmPzmWm 
<a\\IijPzj\\j 

+ {l-a)\\I{MPzM\\M 

+ {l-a)\\2I].MRPzj\\M 

<OiP,j\\zj\\.J 

+ (1 - a)l3M\\zM\\M 

+ {l-a)\\2WMRPzj\\M. 

where the equality is by definition of the a weighted 
norm (5), the first inequality is from the triangle in- 
equality, and the second inequality is by Lemma 5. 
Now, we claim that there exists some finite C such 
that 

|l2nMi?P2/|U/ <qi2/||,/, VyeM". (7) 

To see this, note that since M" is a finite dimensional 
real vector space, all vector norms are equivalent (Horn 
& Johnson, f 985) therefore there exist finite Ci and C2 
such that for all y e M" 

Ci||2nMi?Py||2 < \\2I{MRPy\\M < C^imMRPyh, 



where || • II2 denotes the Euclidean norm. Let A de- 
note the spectral norm of the matrix 2IImRP, which 
is finite since all the matrix elements are finite. We 
have 

||2nMi?P2;||2 < A||y||2, Vy G W'\ 

Using again the fact that all vector norms are equiva- 
lent, there exists a finite C3 such that 

hh <C3\\y\\j, VyeM". 

Setting C = C2AC3 we get the desired bound. Let 
/3 = max{f3j, /3]xj} < 1, and choose e > such that 

/3 + e < f. 

Now, choose a such that 

C 



a 



We have that 



e + C 
(1 - a)C = at, 



and plugging in (7) 

(l-a)||2nMi?Py||M<ae||y||j. 
Plugging in (6) we have 

al3j\\zj\\j + (f - a)PM\\zM\\M + (1 - a)\\2UMRPzj\\M 
<a/?j||z.7||.7 + (f - a)l3M\\zM\\M + ae\\zj\\,j 
<{(3 + e) {a\\zj\\j + {1- a)\\zM\\M) 

and therefore 

\\TlVz\\^ < (^-|-e)||z||„ 
Finally, choose f3 = f3 -\- e. □ 

Lemma 7 guarantees that the projected operator LET 
has a unique fixed point. Let us denote this fixed 
point by z* , and let w*j,w\,j denote the correspond- 
ing weights, which are unique due to Assumption 3 



WTz* = z*, 
Zj = '^JWj, 

Z*M ^MW*M. 



(8) 



In the next lemma we provide a bound on the approx- 
imation error. The proof is in Appendix B. 

Lemma 8. Let Assumptions 1, 3, and 4 hold. De- 



note by Ztr 



the true value and second moment 



functions, i.e., Ztme satisfies z^ 

I 



Tz 



true • 



\Ztr 



Z \\a < 



1-/9 

with a and (3 defined in Lemma 7. 



rue 



The 
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4. Simulation Based Estimation 
Algorithms 

We now use the theoretical results of the previous 
subsection to derive simulation based algorithms for 
jointly estimating the value function and second mo- 
ment. The projected equation (8) is linear, and can 
be written in matrix form as follows. First let us write 
the equation explicitly as 



Um [Rr + 2RP^jw*j + P^mw*m) = ^m^m- 



(9) 



Projecting a vector y onto satisfies the following 
orthogonality condition 

$^Q(y-$w) = 0, 

therefore we have 

$jQ($ju;}-(r + P$ju;}))=0, 
^liQ {'^MW*M - [Rr + 2RP<i>jw*j + P<Pmw*m)) = 0, 



which can be written as 



following estimates of the terms in (11) 

Am 



N 



bN = E 



N 



^ (pj{xt){(l>j{xt) - (t)j{xt+l)f 

.t=0 

T-1 

^(pj{xt)r{xt) 



Cat =E 



N 



.t=0 
V-1 



<i>M{xt){(j)M{xt) ~ cj)M{xt+l))'^ 



dN ^E 



N 



.t=0 
V-1 



<i)M{xt)r{xt) {r(xt) + 2(j)j{xt+if A^^Im) 

.t=o 

(12) 

where E^v denotes an empirical average over trajecto- 
ries, i.e., Ejv [f{x, r)] = ^ ^^=1 f{x\r^)- The LSTD 
approximation is given by 

A-^hi^ 



The next theorem shows that the LSTD approxima- 
tion converges. 

Theorem 9. Let Assumptions 1, 3, and 4 hold. Then 
Wj — > Wj and w\j as iV — >■ (X) with probability 



The proof involves a straightforward application of the 
law of large numbers and is described in Appendix C. 



Aw*j = b, 

^ (10) 
Cwm = d, 

with 

A = $5Q(/-P)$,;, b^^^jQr, 
C = $I,g (/ - P) $M, d = ^IjQR (r + 2P'S>jA-'b) , 

(11) 

and the matrices A and C are invertible since Lemma 
7 guarantees a unique solution to (8) and Assumption 
3 guarantees the unique weights of its projection. 

4.1. A Least Squares TD Algorithm 

Our first simulation based algorithm is an extension 
of the Least Squares Temporal Difference (LSTD) al- 
gorithm (Boyan, 2002). We simulate N trajectories 
of the MDP with the policy tt and initial state dis- 
tribution (q. Let Xq,Xi, . . . ,x'^k_i and r'', where 
k — 0,1, . . . , N, denote the state sequence and visit 
times to the terminal state within these trajectories, 
respectively. We now use these trajectories to form the 



4.2. An online TD(0) Algorithm 

Our second estimation algorithm is an extension of 
the well known TD(0) algorithm (Sutton & Barto, 
1998). Again, we simulate trajectories of the MDP 
corresponding to the policy tt and initial state distri- 
bution Co I and we iteratively update our estimates at 
every visit to the terminal state^. For some < t < t'^ 
and weights vjj,WM) we introduce the TD terms 

5){t,wj,WM) =r(a;f) + (<?!>j(xt\i)^ - (l).j{x^tf) wj, 

6lt{t,wj,WM) =r^{x\)+2r{x^^)4)j{x\+^)^wj 

+ {<l)M{x^t+lf - (t^Mix'if) WM- 

Note that is the standard TD error (Sutton & 
Barto, 1998). The TD(0) update is given by 

Wj-k+l = Wj-k + & ^ (l>j{xt)Sj{t,Wj.k,WM;k), 
t=0 

WM-k+l ^ WM;k + ^k ^ (t)M{xt)S'li{t,Wj.k,WM;k), 
t=0 

"^An extension to an algorithm that updates at every 
state transition is also possible, but we do not pursue such 
here. 
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where {^k} a-re positive step sizes. 

The next theorem shows that the TD(0) algorithm 
converges. 

Theorem 10. Let Assumptions 1, 3, and 4 hold, and 
let the step sizes satisfy 



k=Q 



fe=0 



Then wj-k w*j and wu-.k ~^ as A: —> oo with 
probability 1. 

The proof, provided in Appendix D, is based on repre- 
senting the TD(0) algorithm as a stochastic approxi- 
mation and using contraction properties similar to the 
ones of the previous section to prove convergence. 

4.3. Multistep Algorithms 

A common method in value function approximation is 
to replace the single step mapping Tj with a multistep 
version of the form 



where 



and 



p(A) 



(l-A)^A'F 



.1+1 



1=0 



Simulation based estimates A^^' and fe^'' of the ex- 
pressions above may be obtained by the use of eligi- 
bility traces, as described in (Bcrtsckas, 2012), and 
the LSTD(A) approximation is then given by w^^'^^ — 



^^^'^ ^h^N^ ■ By substituting Wj^'^^ with li*}^^ in the 



expression for d^^\ a similar procedure may be used 



to derive estimates cl^'' and d)^' , and to obtain the 



(A) 



LSTD(A) approximation li 



M 



/',-((A)\-l ,(A) 



''N 



Due 



to the similarity to the LSTD procedure in (12), the 
exact details are omitted. 



Tj^' = (l-A)^A'T^+i 



1=0 

with < A < 1. The projected equation (9) then 
becomes 

Similarly, we may write a multistep equation for M 



(13) 



where 



and 



1=0 



Tm- iv) ^Rr + 2RP<^jw*}^^ + Py. 



Note the difference between Tm* and Tm defined ear- 
lier; We are no longer working on the joint space of J 
and M but instead we have an independent equation 
for approximating J, and its solution Wj'^'* is part of 
equation (13) for approximating M. By Proposition 
7.1.1. of (Bertsekas, 2012) both UjT^/^ and H^tI^' 
are contractions with respect to the weighted norm 
II • llg, therefore both multistep projected equations ad- 
mit a unique solution. In a similar manner to the single 
step version, the projected equations may be written 
in matrix form 



(14) 



5. Positive Variance as a Constraint in 
LSTD 

The TD algorithms of the preceding section approx- 
imated J and M by the solution to the fixed point 
equation (8). While Lemma 8 provides us a bound on 
the approximation error of J and M measured in the 
a-weighted norm, it does not guarantee that the ap- 
proximated variance V, given by M — J^, is positive 
for all states. If we are estimating M as a means to in- 
fer V, it may be useful to include our prior knowledge 
that y > in the estimation process. In this section 
we propose to enforce this knowledge as a constraint 
in the projected fixed point equation. 

The multistep equation for the second moment weights 
(13) may be written with the projection operator as 
an explicit minimization 



w^,} ' — argmin ||<i>MW — I f + ^Wj^j 



(A) 



9J 



with 
and 



r = J — 



(/ - XP)-^ (i?r + 2RP^jw*j 



(a; 



Requiring non negative variance in some state x may 

KA) 

"M 
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Let {xi, . . . ,xi} denote a set of states in which we de- 
mand that the variance be non negative. Let H e 
]^ix/M denote a matrix with the features —(j)Jj{xi) as 
its rows, and let 5 e denote a vector with ele- 
ments — {(j)j{xiY''w*j'^'^Y. We can write the variance- 
constrained projected equation for the second moment 
as 

^^_Jargmin^ ||4>mw - (f + ) |1<, 
[s.t. Hw < g 

The following assumption guarantees that the con- 
straints in (15) admit a feasible solution. 

Assumption 11. There exists w such that Hw < g. 

Note that a simple way to satisfy Assumption 11 is to 
have some feature vector that is positive for all states. 
Equation (15) is a form of projected equation stud- 
ied in (Bertsekas, 2011), the solution of which may be 
obtained by the following iterative procedure 

Wk+i = U^.^wJ^k - iE-\C^^^Wk - d(^))], (16) 

where S is some positive definite matrix, and IT- 

denotes a projection onto the convex set Wm = 
{w\Hw < g} with respect to the S weighted Euclidean 
norm. The following lemma, which is based on a con- 
vergence result of (Bertsekas, 2011), guarantees that 
algorithm (16) converges. 

Lemma 12. Assume A > 0. Then there exists 7 > 
such that W"/ € (0,7) the algorithm (16) converges at 
a linear rate to w^j. 

Proof. This is a direct application of the convergence 
result in (Bertsekas, 2011). The only nontrivial as- 
sumption that needs to be verified is that T^^^ is a 
contraction in the || • |lg norm (Proposition 1 in Bert- 
sekas, 2011). For A > Proposition 7.1.1. of (Bert- 
sekas, 2012) guarantees that T^^' is indeed contracting 
in the || • ||g norm. □ 

We illustrate the effect of the positive variance con- 
straint in a simple example. Consider the Markov 
chain depicted in Figure 1, which consists of N states 
with reward —1 and a terminal state x* with zero re- 
ward. The transitions from each state is either to a 
subsequent state (with probability p) or to a preced- 
ing state (with probability l—p), with the exception of 
the first state which transitions to itself instead. We 
chose to approximate J and M with polynomials of 
degree 1 and 2, respectively. For such a small problem 
the fixed point equation (14) may be solved exactly, 
yielding the approximation depicted in Figure 2 (dot- 
ted line), for p = 0.7, N = 30, and A = 0.95. Note 



that the variance is negative for the last two states. 
Using algorithm (16) we obtained a positive variance 
constrained approximation, which is depicted in figure 
2 (dashed line). Note that the variance is now positive 
for all states (as was required by the constraints). 




Figure 1. A Markov chain 



6. Experiments 

In this section we present numerical simulations of pol- 
icy evaluation on a challenging continuous maze do- 
main. The goal of this presentation is twofold; first, we 
show that the variance function may be estimated suc- 
cessfully on a large domain using a reasonable amount 
of samples. Second, the intuitive maze domain high- 
lights the information that may be gleaned from the 
variance function. We begin by describing the domain 
and then present our policy evaluation results. 

The Pinball Domain (Konidaris & Barto, 2009) is 
a continuous 2-dimensional maze where a small ball 
needs to be maneuvered between obstacles to reach 
some target area, as depicted in figure 3 (left). The 
ball is controlled by applying a constant force in one 
of the 4 directions at each time step, which causes 
acceleration in the respective direction. In addition, 
the ball's velocity is susceptible to additive Gaussian 
noise (zero mean, standard deviation 0.03) and friction 
(drag coefficient 0.995). The state of the ball is thus 4- 
dimensional (x, ?/, i, y), and the action set is discrete, 
with 4 available controls. The obstacles are sharply 
shaped and fully elastic, and collisions cause the ball 
to bounce. As noted in (Konidaris & Barto, 2009), 
the sharp obstacles and continuous dynamics make the 
pinball domain more challenging for RL than simple 
navigation tasks or typical benchmarks like Acrobot. 

A Java implementation of the pinball domain used in 
(Konidaris & Barto, 2009) is available on-line ^ and 
was used for our simulations as well, with the addition 
of noise to the velocity. 

We obtained a near-optimal policy using SARSA (Sut- 
ton & Barto, 1998) with radial basis function features 
and a reward of -1 for all states until reaching the tar- 
get. The value function for this policy is plotted in 

^http: / /people. csail.mit.edu/gdk/software. html 
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Figure 2. Value, second moment and variance approximation 



Figure 3, for states with zero velocity. As should be 
expected, the value is approximately a linear function 
of the distance to the target. 

Using 3000 trajectories (starting from uniformly dis- 
tributed random states in the maze) we estimated the 
value and second moment functions by the LSTD(A) 
algorithm described above. We used uniform tile cod- 
ing as features (50 x 50 non-overlapping tiles in x and 
y, no dependence on velocity) and set A = 0.9. The re- 
sulting estimated standard deviation function is shown 
in Figure 4 (left). In comparison, the standard devia- 
tion function shown in Figure 4 (right) was estimated 
by the naive sample variance, and required 500 trajec- 
tories from each point - a total of 1,250,000 trajecto- 
ries. 

Note that the variance function is clearly not a linear 
function of the distance to the target, and in some 
places not even monotone. Furthermore, we see that 
an area in the top part of the maze before the first turn 
is very risky, even more than the farthest point from 
the target. We stress that this information cannot be 
gleaned from inspecting the value function alone. 

7. Conclusion 

This work presented a novel framework for policy eval- 
uation in RL with variance related performance crite- 
ria. We presented both formal guarantees and empir- 
ical evidence that this approach is useful in problems 
with a large state space. 

A few issues are in need of further investigation. First, 
we note a possible extension to other risk measures 
such as the percentile criterion (Delage & Mannor, 
2010). In a recent work, Morimura ct al. (2012) de- 
rived Bellman equations for the distribution of the to- 
tal return, and appropriate TD learning rules were 
proposed, albeit without function approximation and 
formal guarantees. 



More importantly, at the moment it remains unclear 
how the variance function may be used for policy opti- 
mization. While a naive policy improvement step may 
be performed, its usefulness should be questioned, as 
it was shown to be problematic for the standard devi- 
ation adjusted reward (Sobcl, 1982) and the variance 
constrained reward (Mannor & Tsitsiklis, 2011). In 
(Tamar ct al., 2012), a policy gradient approach was 
proposed for handling variance related criteria, which 
may be extended to an actor-critic method by using 
the variance function presented here. 
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Figure 3. The pinball domain 
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Figure 4. Standard Deviation of Reward To Go 
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Supplementary Material 

A. Proof of Proposition 2 

Proof. The equation for J{x) is well-known, and its proof is given here only for completeness. Choose x X. 
Then, 

J{x) = E[B\xo ^x] 



E 



^r{xk)\xo 
r(x) + E 



r{x)+E 



r-l 

^r{xk)\xQ = X 
fc=l 

E y^r(xfc)|xo = x,xi =y 
fc=i 



^r{x) + J2Piy\x)Jiy) 

vex 

where we excluded the terminal state from the sum since reaching it ends the trajectory. 
Similarly, 

M{x) =E[B^\xo= x] 



= E 



= E 



^r{xk)\ \xo = x 

r 1 Y 

(xo) + ^ r{xk) \xo ^ X 



fe=i 



r{x)^ + 2r{x)E 



V-l 



^r{xk)\xo = 



.k=l 



-E 



^r{xk) \xo^x 



= r{xf + 2r{x) P{v\x)J{v) + ^ P{v\x)M{y). 

The uniqueness of the value function J for a proper policy is well known, c.f. proposition 3.2.1 in (Bertsekas, 
2012). The uniqueness of M follows by observing that in the equation for M, M may be seen as the value 
function of an MDP with the same transitions but with reward r(x)^ + 2r(x) J2yex P{v\^)J{y)- Since only the 
rewards change, the policy remains proper and proposition 3.2.1 in (Bertsekas, 2012) applies. □ 

B. Proof of Lemma 8 



Proof. We have 



\^true ^ \\a — \\^true nz^^^gjl^ -|- ||IlZf^^g Z ||q. 



rearranging gives the stated result. 



□ 
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C. Proof of Theorem 9 

Proof. Let 0i(x), 4>2{x) be some vector functions of the state. We claim that 



E 



V-l 



.t=0 



To see this, let l(-) denote the indicator fmiction and write 



E 



.t=0 



E 



:E 



^^0i(a;)</)2(a;)^l(a;t =x) 

,t=0 X 

T-1 



t=o 

't-1 



.t=0 



J2Mxt = x) 



(17) 



Now, note that the last term on the right hand side is an expectation (over all possible trajectories) of the 
number of visits to a state x until reaching the terminal state, which is exactly q{x) since 



lix) = ^P{xt = x) 

t=0 
oo 

= j2wxt^x)] 

3 

oo 

Y,l{x,=x] 



t=o 



= E 



= E 



.t=o 

V-l 



J2M^t = x) 



t=0 



where the last equality follows from the absorbing property of the terminal state. Similarly, we have 



E 



^(j>l{xt)(j>2{xt+l)^ 



5]^g(x)P(2;|x)0i(x)(/.2(2/)^, 



(18) 



X y 



Since 



E 



^(t)i{xt)(t)2{xt+iy 



.t=0 



E 



E 



^^^M^)My)^'^{xt = x,xt+i = y) 

X y 

T-1 

Mx)h{yf ^H^^t =x,xt+i =y) 

. X y t—0 
r-l 



^^0i(a;)02(y)^E 



^ l(xt = x,xt+i = y) 
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and 



q{x)Piy\x) = = x)Piy\x) 

t=Q 

oo 

= ^P{xt = x,xt+i = y) 

t=Q 

oo 

^^E[l{xt = x,xt+i = y)] 

oo 

^l{xt = x,xt+i = y) 
.t=o 

"r-l 

^ 1(2:4 = x,xt+i = y) 



t=o 



E 



E 



Since trajectories between visits to the recurrent state arc statistically independent, the law of large numbers 
together with the expressions in (17) and (18) suggest that the approximate expressions in (12) converge to their 
expected values with probability 1, therefore we have 



and 



'jj.j^ = Aj^^bN^A ^b = w*j, 



□ 



D. Proof of Theorem 10 

Proof. Using (17) and (18) we have for all k 



E 



E 



X! <i^j{^t)5){t,wj,WM) 

^ <t)M{xt)SM{t,Wj,WM) 



t=0 



<f^jQr~<i>^jQ{I^P)<fjWj, 



^l,QR (r + 2P<i>jWj) - $X,Q (/ - P) <^>mWm, 



(19) 



Letting lik — {wj-^k,WM;k) denote a concatenated weight vector in the joint space M.^-' x M.^^' we can write the 
TD algorithm in a stochastic approximation form as 



Wk+i = Wk +^k{z + Mwk + 5Mk+i) , 



(20) 



where 



M 



$jQ(P-/)$j 
2$^Qi?P$j <^IjQ {P ~ I) <i>M 

and the noise terms SMk+i satisfy 

E[,5Mfe+i|^^„] -0, 

where Fn is the filtration Fn = a{'w„n5Mm,m < n), since different trajectories are independent. 
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We first claim that tlie eigenvalues of M have a negative real part. To see this, observe that M is block 
triangular, and its eigenvalues are just the eigenvalues of $J(5 {P ~ I) £^nd ^JfQ {P — I) ^m- By Lemma 
6.10 in (Bertsekas & Tsitsiklis, 1996) these matrices are negative definite. It therefore follows (see Bertsekas, 
2012 example 6.6) that their eigenvalues have a negative real part. Thus, the eigenvalues of M have a negative 
real part. 

Next, let h{w) — Mw + z, and observe that the following conditions hold. 
A 1. The map h is Lipschitz. 
A 2. The step sizes satisfy 

C30 OO 

fe=0 k=0 

A 3. {SMn} is a martingale difference sequence, i.e., E [(JMn-i-ilFn] = 0. 
The next condition also holds 

A 4. The functions hc{w) = h{cw)/c,c > 1 satisfy hc{w) — > hoo{w) as c oo, uniformly on compacts, and 
hociw) is continuous. Furthermore, the Ordinary Differential Equation (ODE) 

w{t) = hociw{t)) 

has the origin as its unique globally asymptotically .stable equilibrium. 

This is easily verified by noting that h(cw)/c = Mw + c~^z, and since z is finite, hf.{w) converges uniformly as 
c — > oo to hoo{w) = Mw. The stability of the origin is guaranteed since the eigenvalues of M have a negative 
real part. 

Theorem 7 in Chapter 3 of (Borkar, 2008) states that if Al - A4 hold, the following condition holds 
A 5. The iterates of (20) remain bounded almost surely, i.e., supj, ||wfc|| < oo, a.s. 

Finally, we use a standard stochastic approximation result that, given that the above conditions hold, relates 
the convergence of the iterates of (20) with the asymptotic behavior of the ODE 

w{t) = h{w{t)). (21) 

Since the eigenvalues of M have a negative real part, (21) has a unique globally asymptotically stable equilibrium 
point, which by (10) is exactly w* = (w},?I>^^). Formally, by Theorem 2 in Chapter 2 of (Borkar, 2008) we have 
that if Al - A3 and A5 hold, then Wk ~> w* a,s k oo with probability 1. □ 



